# Table of Contents

1.  [About](#about)
2.  [Contributions](#contributions)
3.  [Plans](#plans)
    1.  [Using `guile-ffi-cblas`](#using-guile-ffi-cblas)
4.  [Tests](#tests)
5.  [Approach](#approach)
    1.  [Functional programming](#approach-functional-programming)
    2.  [Data representation](#approach-data-representation)
6.  [Known Caveats](#known-caveats)



<a id="about"></a>

# About

(This readme file might not render 100% precisely on many git host websites. If something seems
strange, you can look at the raw version of this file.)

This is a collection of machine learning algorithms implemented in GNU Guile. Currently implemented
algorithms are:

-   decision tree

Nope, sorry, currently that is it.

The decision tree algorithm is implemented in a purely functional way, which helps immensely with
parallelization of the algorithm and unit testing the code. My vision for this repository is, that
more purely functional machine learning algorithms are added later. My preference is strongly for
purely functional algorithms.

Feel free to report bugs, report possible improvements, contribute machine learning algorithms in
GNU Guile or standard Scheme runnable in GNU Guile (perhaps even non-purely functional, so that we
can later on transform them into purely functional ones), fork or use this code in any license
adhering way.


<a id="contributions"></a>

# Contributions

There are many things, that could be done. A few will be listed here, in a non-comprehensive list:

-   Find and report bugs.

-   Add more tests.

-   Implement more machine learning algorithms.
    -   Based on decision trees, random forest could be a next logical step.

-   Come up with an idea for a common interface for algorithms.

-   Improve the code.
    -   Make the code more readable.
    -   Fix performance issues (but not at the cost of readability).

-   Write documentation.
    -   Write beginner friendly documents, for example covering how some algorithm works.
    -   Code documentation: How does this function / procedure work?
    -   Write a usage guide.

-   Make use of the code and share experiences with it.

-   Inform about problems, which prevent usage of the code.

-   Come up with an idea of how to safely implement the ability to limit recursive parallelism to a
    number of cores. Such an idea would be best, if paying attention to the following ideals, if such
    a thing is possible:
    -   avoids global state
    -   avoids locks and mutexes,
    -   is itself limited to a parallelism module, so that it can be exchanged easily
    -   algorithms do not need to know how it internally works

-   Help with setting up packaging for package managers like GUIX.


<a id="plans"></a>

# Plans

The plan is to add more machine learning algorithms. Before implementing anything very complex or
anything with very specific application areas, basic algorithms would probably be implemented:

-   linear regression
-   logistic regression
-   random forest
-   k nearest neighbors
-   k means
-   simple artificial neural networks

Personally I would also like to understand the implementations. Code readability is valued
strongly. Without readable code, I would definitely hesitate to add an algorithm. If a lot of
mathematics is involved, a companion document explaining the mathematical details might be
needed.

Many algorithms lend themselves well to matrix multiplication. A choice would have to be made about
what kind of library to use for that, or at least a basic implementation in Guile to be written,
which, abstracted away cleanly, can act as fallback, for the time, when there is not yet a fast
matrix multiplication library chosen.

Here are some candidates for matrix multiplication:

-   [Schemetran](https://gitlab.com/codetk/schemetran) - apparently a way to write Guile and have it translated to Fortran, which then
    supposedly will be compiled and optimized by a Fortran compiler, which should deliver good
    performance - apparently it also requires, that you know, how to write Fortran code.

-   Perhaps there are some bindings for the [GNU Scientific Library](https://www.gnu.org/software/gsl/)?

-   [guile-ffi-cblas](https://github.com/lloda/guile-ffi-cblas) - seems like exactly what is needed. Perhaps
    <https://www.netlib.org/blas/#_documentation> can enlighten about the weirdly named functions of the
    API?


<a id="using-guile-ffi-cblas"></a>

## Using `guile-ffi-cblas`

When installing Guile via Guix package manager, it is important, that CBLAS or BLIS are also
available in a place, where programs installed via Guix look for it. The easiest way to achieve this
would be to install CBLAS or BLIS through Guix as well. There are currently 3 related packages
available in Guix package repositories: `blis`, `openblas` and `lapack`. In my case I installed all
3 of them:

    guix environment --ad-hoc guile blis openblas lapack

It seems, that using BLIS works best. However, it took almost 2h to build and be checked.

Note the following:

-   The OpenBLAS installed through Guix currently (at 2020-03-19) does not export the symbols or
    functions for rotating matrices. `guile-ffi-cblas` works around this problem.

-   The version of BLIS available from Guix repositories currently (at 2020-03-19) does provide those
    symbols or functions.

In my case that directory where Guix installed programs should look is
`"${HOME}/.guix-profile/lib"`. To generalize, it will probably be the `lib` directory of whatever
Guix profile or environment you installed GNU Guile in.

To tell Guile to look for the `CBLAS` or `BLIS` library there, an environment variable
`LTDL_LIBRARY_PATH` (it stands for libtool dynamic link library path) needs to be set to be the path
of that `lib` directory. The following shell commands show how to run test suites of
`guile-ffi-cblas` for various backends.


### BLIS

    # switch to a temporary directory
    pushd $(mktemp --directory)

    # clone the git repository of guile-ffi-cblas
    git clone https://notabug.org/lloda/guile-ffi-cblas.git

    # switch to cloned git repository
    pushd guile-ffi-cblas

    # create an environment with the required libraries installed
    guix environment --ad-hoc guile blis openblas lapack

    # for BLIS (it is the default of the guile-ffi-cblas library)
    # TODO: Perhaps there is a better way to get the lib folder of the environment?
    LTDL_LIBRARY_PATH="$(which guile | rev | cut --delimiter '/' --fields 3- | rev)/lib" \
    guile -L mod -s test/test-ffi-blis.scm

    # or
    # GUILE_FFI_BLIS_LIBNAME=libblis \
    # GUILE_FFI_BLIS_LIBPATH="$(which guile | rev | cut --delimiter '/' --fields 3- | rev)/lib" \
    # guile -L mod -s test/test-ffi-blis.scm

    popd
    popd


### CBLAS

    # switch to a temporary directory
    pushd $(mktemp --directory)

    # clone the git repository of guile-ffi-cblas
    git clone https://notabug.org/lloda/guile-ffi-cblas.git

    # switch to cloned git repository
    pushd guile-ffi-cblas

    # create an environment with the required libraries installed
    guix environment --ad-hoc guile blis openblas lapack

    # for CBLAS through OpenBLAS
    LTDL_LIBRARY_PATH="$(which guile | rev | cut --delimiter '/' --fields 3- | rev)/lib" \
    GUILE_FFI_CBLAS_LIBNAME=libopenblas \
    guile -L mod -s test/test-ffi-cblas.scm

    # or
    # GUILE_FFI_CBLAS_LIBNAME=libopenblas \
    # GUILE_FFI_CBLAS_LIBPATH="$(which guile | rev | cut --delimiter '/' --fields 3- | rev)/lib" \
    # guile -L mod -s test/test-ffi-cblas.scm

    popd
    popd


### Usage with GNU Guile programs

One probably needs to set the required environment variables, when running a program, which makes
use of `guile-ffi-cblas`, as well, as `guile-ffi-cblas` does:

    GUILE_FFI_BLIS_LIBNAME=libblis \
    GUILE_FFI_BLIS_LIBPATH="$(which guile | rev | cut --delimiter '/' --fields 3- | rev)/lib" \
    guile -L mod -s test/test-ffi-blis.scm


<a id="tests"></a>

# Tests

You can run the tests by using the make file as follows:

    # from the root directory of this project:
    make test

The tests cover a lot of the functionality. Future development should try to maintain this level of
coverage or exceed it.


<a id="approach"></a>

# Approach


<a id="approach-functional-programming"></a>

## Functional programming

In general, the idea is to implement machine learning algorithms in a purely functional way, which
will help with parallelization, testability and avoiding bugs due to mutable state.


<a id="approach-data-representation"></a>

## Data representation

-   A dataset is currently represented by a list of vectors. Rows are represented by vectors.


<a id="known-caveats"></a>

# Known Caveats

-   There is currently no way to limit parallelism. If you need to limit the number of cores this
    program uses, you need to limit the number of cores available to the Guile process.

-   Currently there is no choice of parallelism costructs to use for running the algorithm in
    parallel.

-   Currently there is no way to configure, whether Guile tries to run the algorithm in parallel or
    purely sequentially.

-   There is no concept yet for a common interface to the decision tree algorithm and other possibly
    later added machine learning algoritms.

-   Currently the decision tree algorithm does not make use of any heuristic for selecting split
    feature values. It checks every occurring value of the split feature in the data set. This is
    computationally expensive. However, it should deliver the best, by this algorithm learnable,
    decision tree for specified parameters.

-   Currently there is only one split quality measure implemented. Other implementations often offer
    multiple quality measures.
